Skip to content

Add client identifier header to HTTP requests#1075

Merged
ericmj merged 9 commits intohexpm:mainfrom
sorentwo:client-identifier
Sep 24, 2025
Merged

Add client identifier header to HTTP requests#1075
ericmj merged 9 commits intohexpm:mainfrom
sorentwo:client-identifier

Conversation

@sorentwo
Copy link
Copy Markdown
Contributor

@sorentwo sorentwo commented Jun 2, 2025

Creates an anonymized repository identifier based on the SHA of the first commit, hashed with SHA256 for additional privacy. The identifier is sent as an x-hex-client-id header when available.

For example, in the hex repo itself:

iex(1)> Hex.Utils.client_identifier()
"0f3d954cda05af53f8921bc77be8d2fe1697daded03115fd17b7a3aef32b23b1"

The value can then be used by hex.pm and private hex servers to more accurately count requests. That will give a clearer picture of how many unique applications are using a particular package, rather than just the raw number of downloads.


Some followup steps if this approach is accepted:

  • Fix test failures in CI from environment mismatch
  • Potentially cache the value to prevent repeated system calls
  • Add a flag to disable sending the client id

/cc @wojtekmach

Creates an anonymized repository identifier based on the SHA of the
first commit, hashed with SHA256 for additional privacy. The identifier
is sent as an `x-hex-client-id` header when available.
@sorentwo sorentwo force-pushed the client-identifier branch from 3839199 to 95452f7 Compare June 2, 2025 14:45
sorentwo added 2 commits June 9, 2025 11:04
The new variable allows users to opt-out of identification by setting
the value to anything other than `1` or `true`. It remains enabled by
default.
lib/hex/utils.ex Outdated
- The current directory isn't within a git repository
"""
def repo_identifier do
with :unset <- Process.get(:hex_repo_identifier, :unset),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how useful this will be in practice, as I imagine fetching deps is parallelized across multiple processes. However, the codebase appears to support back to Elixir v1.6, hence OTP 19, which is before persistent_term was available and we can't cache between processes very easily.

What does everybody think? Is this useful? Is it worth looking at ETS for caching?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you look into using Hex.State (an agent) for storage? We even have some affordances there for env variables but not sure if they'd help in this particular case, i.e. I don't think we need to be able to set a custom repo id via env.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I switched to Hex.State to check whether the identifier is enabled or disabled, as well as caching the value, which is far more convenient.

Do you think they should be combined into a single repo_identifier value that's either nil, false, or the cached binary?

@sorentwo sorentwo force-pushed the client-identifier branch from c8c4cf8 to 4e1cc84 Compare June 11, 2025 14:47
Prevents fetching the same identifier on every client call by caching
the value in the current process dictionary. While the git command isn't
particularly slow, this should avoid spawning during repeated http
calls.
@sorentwo sorentwo force-pushed the client-identifier branch from 4e1cc84 to ffb5c02 Compare June 11, 2025 14:54
Switch from a raw `System.get_env` to using the normalized state
agent. Also prevent STDERR from leaking when called outside of a git
repository.
@sorentwo sorentwo force-pushed the client-identifier branch from 15ea8ff to 100e500 Compare June 16, 2025 14:19
@sorentwo sorentwo force-pushed the client-identifier branch from dd07cd7 to 34d84bc Compare June 16, 2025 14:30
@zachdaniel
Copy link
Copy Markdown

I'd love to see this make it in, would be very impactful information for a lot of libraries and frameworks. Anything I can do to help?

Copy link
Copy Markdown
Member

@ericmj ericmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this considered PII and how do we need to inform users about this change other than a note in the changelog?

Other tracking we've talked about adding is:

  • Session ID header, which would be a unique ID for each mix deps.get
  • A header if the env var CI is set, to help identity real users vs CI runs

The repo identifier value is now cached with an agent, which also
serializes access to fetch the state. This also removes the value from
`Hex.State`, which was previously used for caching but is ultimately
unnecessary.
@sorentwo
Copy link
Copy Markdown
Contributor Author

@ericmj Took me a while, but I've addressed both of your requests.

@ericmj ericmj merged commit 1264476 into hexpm:main Sep 24, 2025
8 checks passed
@ericmj
Copy link
Copy Markdown
Member

ericmj commented Sep 24, 2025

Thank you @sorentwo 💜


Returns `nil` when:

- The `HEX_REPO_IDENTIFIER` environment variable is set to anything other `1` or `true`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be “ HEX_NO_REPO_IDENTIFIER”

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either version is fine by me (HEX_REPO_IDENTIFIER=false or HEX_NO_REPO_IDENTIFIER=true)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the doc here says: HEX_REPO_IDENTIFIER

But the code is actually using NO_HEX_REPO_IDENTIFIER:

https://github.com/sorentwo/hex/blob/5e329783583aa4d5929953ce95a73d68590406d2/lib/hex/state.ex#L115-L116

So maybe the docs should be changed to NO_HEX_REPO_IDENTIFIER.

Or, I'm missing something. 😅

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake! This PR drifted a while and I completely forgot about the name change. Glad @ericmj took care of it =)

@ericmj
Copy link
Copy Markdown
Member

ericmj commented Oct 24, 2025

We need to ask for consent before sending this telemetry according to GDPR because it creates a persistent identifier and is not required for the functionality of Hex.

Since it has limited usefulness when it's opt-in I am inclined to remove it.

cc @sorentwo

@hugobarauna
Copy link
Copy Markdown

We need to ask for consent before sending this telemetry according to GDPR because it creates a persistent identifier and is not required for the functionality of Hex.

Since it has limited usefulness when it's opt-in I am inclined to remove it.

cc @sorentwo

@ericmj I’m not sure this identifier would fall under GDPR.

It’s a git repo–scoped value, not tied to any user or machine. Different people or machines using the same repo would generate the same identifier, reinforcing that it’s linked to the project/git repo, not to a person.

I guess GDPR only applies to data that can identify a natural person, not to identifiers of software artifacts like a Git repository. If it were a machine-specific identifier (like a device ID, IP, or cookie), then it could make a person indirectly identifiable — that’s when GDPR applies. But this one isn’t, so it should be outside GDPR’s scope.

But I'm not an expert on GDPR. Only saying that based on my research and personal experience.

@ericmj
Copy link
Copy Markdown
Member

ericmj commented Oct 24, 2025

@hugobarauna I see your point but it's enough of an edge case where I think the risk is not worth it.

In most cases a repository is not identifiable to a single person but I could imagine cases where it could be argued that a repository identifier is tied to a single person. For example if you start working on a new project for a month before publishing it to github under your username, any mix deps.get HTTP requests for that repository before you published it could be attributed to you with reasonable confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants